Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus

نویسندگان

  • Peter D. Turney
  • Michael L. Littman
چکیده

The evaluative character of a word is called its semantic orientation. A positive semantic orientation implies desirability (e.g., “honest”, “intrepid”) and a negative semantic orientation implies undesirability (e.g., “disturbing”, “superfluous”). This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. The method involves issuing queries to a Web search engine and using pointwise mutual information to analyse the results. The algorithm is empirically evaluated using a training corpus of approximately one hundred billion words — the subset of the Web that is indexed by the chosen search engine. Tested with 3,596 words (1,614 positive and 1,982 negative), the algorithm attains an accuracy of 80%. The 3,596 test words include adjectives, adverbs, nouns, and verbs. The accuracy is comparable with the results achieved by Hatzivassiloglou and McKeown (1997), using a complex four-stage supervised learning algorithm that is restricted to determining the semantic orientation of adjectives. 1. Semantic Orientation from Association Turney and Littman

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ERB-1094.fm

The evaluative character of a word is called its semantic orientation. A positive semantic orientation implies desirability (e.g., “honest”, “intrepid”) and a negative semantic orientation implies undesirability (e.g., “disturbing”, “superfluous”). This paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. The method involves issuing...

متن کامل

Inducing Example-based Semantic Frames from a Massive Amount of Verb Uses

We present an unsupervised method for inducing semantic frames from verb uses in giga-word corpora. Our semantic frames are verb-specific example-based frames that are distinguished according to their senses. We use the Chinese Restaurant Process to automatically induce these frames from a massive amount of verb instances. In our experiments, we acquire broad-coverage semantic frames from two g...

متن کامل

A Step-wise Usage-based Method for Inducing Polysemy-aware Verb Classes

We present an unsupervised method for inducing verb classes from verb uses in gigaword corpora. Our method consists of two clustering steps: verb-specific semantic frames are first induced by clustering verb uses in a corpus and then verb classes are induced by clustering these frames. By taking this step-wise approach, we can not only generate verb classes based on a massive amount of verb use...

متن کامل

Distributional Semantics Approach to Thai Word Sense Disambiguation

Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy...

متن کامل

Reddit Temporal N-gram Corpus and its Applications on Paraphrase and Semantic Similarity in Social Media using a Topic-based Latent Semantic Analysis

This paper introduces a new large-scale n-gram corpus that is created specifically from social media text. Two distinguishing characteristics of this corpus are its monthly temporal attribute and that it is created from 1.65 billion comments of user-generated text in Reddit. The usefulness of this corpus is exemplified and evaluated by a novel Topic-based Latent Semantic Analysis (TLSA) algorit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره cs.LG/0212012  شماره 

صفحات  -

تاریخ انتشار 2002